使用sklearn做自然语言处理-2

From 大邓大邓和他的Python 2019-04-26

之前我们已经分享了一篇《用sklearn做自然语言处理-1》，今天我们继续将sklearn一些高级用法。今天我们使用yelp网站评论数据集做一些探索，除了上一篇文章中用到的特征抽取知识，本文还会有。

本文聚焦于特征工程（feature engineering）和其他步骤，如特征抽取（feature extraction）、构建流水线（pipeline，很多翻译成油管，我个人觉得流水线似乎更准确）、构建定制化的转化器（transformer）、特征融合（feature union）、维度压缩（dimension reduction）和参数调优（grid search）。特征工程是自然语言处理和机器学习中非常重要的步骤，所以本文会继续比较多的时间在这一块。

上一篇文章我们使用的20世界新闻集团数据集，今天我们使用yelp网站评论数据集，并且我已经将其转化为csv文件，方便大家使用pandas来读取数据。yelp数据集包含stars、text、funny、useful、cool五个字段，本文任务是预测每条评论stars值。

1、导入yelp评论数据集

import pandas as pd

dataset = pd.read_csv('yelp-review-subset.csv', 
                      delimiter=',', 
                      names=['stars', 'text', 'funny', 'useful', 'cool'])

#随机查看5条数据
print(dataset.sample(5))

运行结果

         stars                                               text funny useful  \
    37       3  I wouldnt say the best fish sandwich ive ever ...     0      0   
    1635     2  The order process went well and the delivery a...     0      0   
    1836     3  Checked out the neighborhood restaurant for Su...     0      0   
    1040     4  Dining outdoors after a bike ride on the rail ...     0      1   
    1719     5  If I could give Dr. Bigger 10 stars I would! H...     0      0   
         cool  
    37      0  
    1635    0  
    1836    0  
    1040    0  
    1719    1

我们再看看数据分布

print('yelp评论数据集star值有{0}个等级, 工{1}条评论'.format(len(dataset.stars.unique()), len(dataset)))

#stars值去重后含有取值范围
print(dataset.stars.unique())
#stars数据分布
print(dataset.stars.value_counts())

运行结果

    yelp评论数据集star值有6个等级, 工2501条评论
    ['stars' '4' '5' '3' '2' '1']
    1        500
    5        500
    4        500
    2        500
    3        500
    stars      1
    Name: stars, dtype: int64

接下来我们需要将数据分割为训练集和测试集

2、将数据集切割成训练集和测试集

from sklearn.model_selection import train_test_split

#训练集占3成，数据随机打乱，设置随机状态
X_train, X_test, y_train, y_test = train_test_split(dataset[['text', 'funny', 'useful', 'cool']],  
                                                    dataset['stars'], 
                                                    random_state= 100,
                                                    test_size=0.3)

#查看前5条数据
print(X_train.head())

运行结果

                                                       text funny useful cool
    1818  If you go there, be very careful with the pric...     0      1    0
    751   I am not a fan of chain restaurants typically ...     0      3    2
    545   Love dukes and love the food. I wish i could f...     0      0    0
    198   Tonya is super sweet and the front desk people...     0      0    0
    1439  Ate here last night...fantastic food & service...     0      0    0

现在X_train中含有三列（funny、useful、cool）数值型数据,这种数据非常适合sklearn分析。现在我们还需要从text中抽取特征

print(X_train.columns)

print(type(X_train))
print(type(X_train.text))

运行结果

    Index(['text', 'funny', 'useful', 'cool'], dtype='object')
    <class 'pandas.core.frame.DataFrame'>
    <class 'pandas.core.series.Series'>

现在我们需要对text做特征抽取，使用到CountVectorize，如果对此不懂可以阅读之前分享的一篇文章《如何从文本中提取特征信息》

3、特征抽取

from sklearn.feature_extraction.text import CountVectorizer

counter = CountVectorizer()

#如果这里是中文文本数据，我们需要先分词，词语之间以空格隔开的字符串传入
X_train_counts = counter.fit_transform(X_train.text)

#2000条数据，特征空间有11193个特征词
print(X_train_counts.shape)

运行结果

    (1750, 11193)

现在我们的 X_train_counts 含有抽取自 text的特征空间, 但是我们该如何将文本数据特征与 funny, useful、 cool 之间的关系？

4 Pipeline

pipeline要由多个transformer（起到数据准备），其中最后一个构建一定要由estimator（起到模型训练作用）。大家可能还是有点迷糊，我拿消费者开车会使用汽油，消费者使用汽油是最后一个步骤，而在此之前的对石油的开采运输炼化是transformer。

sklearn的Pipeline方法写法类似于下面这样

pipeline = Pipeline([
                    ('step1', 油田从地下抽石油()),
                    ('step2', 石油炼化厂将石油转化为汽油()),
                    ('step3', 加油站加油()),
                    ('step4', 消费者开车消费())
                   ])

其中step1、step2、step3均为非消费部门，但是上下游衔接合理才能进行协同生产。而协同生产的最终产品通过step4才最终完成消费。所以前3步，我们将其命名为transformer，而最后一步step4是estimator。

我们知道工业化大生产能够实现的关键是整个社会拥有完善的标准体系，在本例石油生产中每个部门只需要知道自己的输入和输出，只需要将输入(fit)和输出(transform)标准化即可。所以为了让sklearn能理解这是准备数据的流程的一部分，我们就要将step4之前的每个步骤统一继承一种公认的标准，即让每个步骤继承(step4之前的步骤)继承TransformerMixin, BaseEstimator两种类（大工业生产中的整个社会遵循的标准体系），且封装成含有输入fit和输出transform方法。有可能fit和transform方法对物料没有任何操作，也可能有操作，但是必须有这两个方法sklearn才能理解。

class ItemSelector(TransformerMixin, BaseEstimator):
    """性质为transformer器。功能是从dataframe中选定某列（某几列）数据"""
    def __init__(self, keys):
        self.keys = keys

    #ItemSelector的输入fit
    def fit(self, x, y=None):
        return self

    #ItemSelector的输出transform
    def transform(self, dataframe):
        return dataframe[self.keys]


class VotesToDictTransformer(TransformerMixin, BaseEstimator):
    """性质为transformer器。功能是将[funny,useful, cool]转化为字典列表"""

    #VotesToDictTransformer的输入fit
    def fit(self, x, y=None):
        return self

    #VotesToDictTransformer的输出transform
    def transform(self, votes):
        funny, useful, cool = votes['funny'], votes['useful'], votes['cool']
        return [{'funny': f, 'useful': u, 'cool': c } 
                 for f, u, c in zip(funny, useful, cool)]

DictVectorizer也是一个tranformer器，只是将数据整理从列表（含有字典的列表）变为特征矩阵。到这一步清理的数据就可以输送给estimator（estimator作用是让模型去学习数据中的规律）。我们先看看DictVectorizer的代码运行效果吧

from sklearn.feature_extraction import DictVectorizer

dictV = DictVectorizer()

data = dictV.fit_transform([{'a':3, 'b':5},{'a':0,'b':2},{'a':12,'b':3}])

print(data.toarray())

print(dictV.get_feature_names())

运行结果

    [[ 3.  5.]
     [ 0.  2.]
     [12.  3.]]
    ['a', 'b']

5. FeatureUnion

这就引出接下来要讲的特征融合（FeatureUnion），即分别计算text对stars的影响a 和 funny、useful、cool对stars影响b。然后给a和b以一定的比例或者权重，从而得到所有数据对stars的影响的模型。

FeatureUnion和Pipeline可以互相嵌套。为了方便理解，我还是举开车的例子。比如我们的汽车是混合动力车，我们想知道每公里运行成本，那么我们需要知道汽油的信息，也需要知道电能的信息，他们的消费比例也需要知道

FeatureUnion(
             transformer_list=[
               ('汽油', Pipeline([
                    ('step1', 油田从地下抽石油()),
                    ('step2', 石油炼化厂将石油转化为汽油()),
                    ('step3', 加油站加油()),
                    ('step4', 消费者开车消费())
                ])),
                ('电', Pipeline([
                    ('step_1', 挖煤()),
                    ('step_2', 火电厂()),
                    ('step_3', 充电站()),
                ])),
             ],
             #能源消费比
             transformer_weights={
                '汽油': 3.0,
                '电': 1
                },
             )

FeatureUnion其实整体上的功能还是一个transformer，因此FeatureUnion可以与estimator一起嵌套进另外一个Pipeline中。解释的好累，不晓得大家有没有理解FeatureUnion和Pipeline。下面直接进入代码，特别长，但是希望大家能够大体看懂每个部分的功能即可，之后的日子里大家可以慢慢的消化。

from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import classification_report

class ItemSelector(TransformerMixin, BaseEstimator):
    """性质为transformer器。功能是从dataframe中选定某列（某几列）数据"""
    
    
    def __init__(self, keys):
        self.keys = keys
    #ItemSelector的输入fit    
    def fit(self, x, y=None):
        return self
    #ItemSelector的输出transform  
    def transform(self, dataframe):
        return dataframe[self.keys]


class VotesToDictTransformer(TransformerMixin, BaseEstimator):
    """性质为transformer器。功能是将[funny,useful, cool]转化为字典列表"""
    #VotesDictTransfromer的输入fit     
    def fit(self, x, y=None):
        return self
    #VotesDictTransfromer的输出transform    
    def transform(self, votes):
        funny, useful, cool = votes['funny'], votes['useful'], votes['cool']
        return [{'funny': f, 'useful': u, 'cool': c } for f, u, c in zip(funny, useful, cool)]


model = Pipeline([
    #使用FeatureUnion将text与votes融合起来
    ('union', FeatureUnion(
        transformer_list=[

            # 从文本中抽取特征矩阵
            ('bagofwords', Pipeline([
                ('selector', ItemSelector(keys='text')),
                ('counts', TfidfVectorizer()),
            ])),

            # 从'funny', 'useful', 'cool'获取特征
            # 将这些特征转化为Dict形式
            # 使用DictVectorizer将dict形式数据转化为矩阵
            ('votes', Pipeline([
                ('selector', ItemSelector(keys=['funny', 'useful', 'cool'])),
                ('votes_to_dict', VotesToDictTransformer()),
                ('vectorizer', DictVectorizer()),
            ])),

        ],

        # 对不同类别的特征赋予权重
        transformer_weights={
            'bag-of-words': 3.0,
            'votes': 1
        },

    )),

    #对融合后的特征矩阵使用LogisticRegression算法
    ('clf', LogisticRegression()),
])


#使用训练集训练模型
model.fit(X_train, y_train)

#使用测试集做预测
predicted = model.predict(X_test)

#生成分类效果报告，查看模型的训练准确率情况
print(classification_report(predicted, y_test))

运行结果

                 precision    recall  f1-score   support
              1       0.69      0.63      0.65       177
              2       0.51      0.44      0.48       165
              3       0.33      0.39      0.36       127
              4       0.29      0.41      0.34       111
              5       0.68      0.54      0.61       171
          stars       0.00      0.00      0.00         0
    avg / total       0.53      0.50      0.51       751

6. Grid Search

在上面的例子中，我们直接使用的TfidfVectorizer和LogisticRegression，实际上两者中海油一些超参数需要我们去调优找到最佳的参数。这里就用到GridSearchCV。

上面model中有

('clf', LogisticRegression())

这个写法，clf代表LogisticRegression。所以使用gridsearch时候，我们要指定LogisticRegression的max_iter和C参数时。要使用clf__max_iter和clf__C，告诉sklearn我们是对名字为clf的LogisticRegression进行调优
。

上面model中有

('counts', TfidfVectorizer())

这个写法，因为该内容嵌套在好几层里面，所以别名挺长的。union__bagofwords__counts代表TfidfVectorizer。所以使用gridsearch时候，我们要指定TfidfVectorizer的max_df参数时。要使用union__bagofwords__counts__max_df告诉sklearn我们是对名字为

union__bagofwords__counts__max_df的TfidfVectorizer进行调优.

注意是别名含有的是双下划线

from sklearn.model_selection import GridSearchCV


params = dict(clf__max_iter=[50, 80, 100], 
              clf__C=[0.1, 0.2, 0.3],
              union__bagofwords__counts__max_df = [0.5, 0.6, 0.7, 0.8])

grid_search = GridSearchCV(model, param_grid=params)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

运行结果

    {'clf__C': 0.3, 'clf__max_iter': 50, 'union__bagofwords__counts__max_df': 0.7}

根据GridSearch调优结果，当LogisticRegression的max_iter=50， C=0.3, max_df=0.7 时候，模型的表现较好。

所以接下来我们设置model中

LogisticRegression的C和max_iter参数，

TfidfVectorizer的max_df = 0.7，

看看准确率是否有提高

model = Pipeline([
    #使用FeatureUnion将text与votes融合起来
    ('union', FeatureUnion(
        transformer_list=[

            # 从文本中抽取特征矩阵
            ('bag-of-words', Pipeline([
                ('selector', ItemSelector(keys='text')),
                ('counts', TfidfVectorizer(max_df = 0.7)),
            ])),

            # 从'funny', 'useful', 'cool'获取特征
            # 将这些特征转化为Dict形式
            # 使用DictVectorizer将dict形式数据转化为矩阵
            ('votes', Pipeline([
                ('selector', ItemSelector(keys=['funny', 'useful', 'cool'])),
                ('votes_to_dict', VotesToDictTransformer()),
                ('vectorizer', DictVectorizer()),
            ])),

        ],

        # 对不同类别的特征赋予权重
        transformer_weights={
            'bag-of-words': 3.0,
            'votes': 1
        },

    )),

    #对融合后的特征矩阵使用LogisticRegression算法
    ('clf', LogisticRegression(C=0.3, max_iter=50)),
])


model.fit(X_train, y_train)
predicted = model.predict(X_test)
print(classification_report(predicted, y_test))

运行结果

                 precision    recall  f1-score   support
              1       0.72      0.66      0.69       177
              2       0.48      0.49      0.48       139
              3       0.37      0.39      0.38       143
              4       0.34      0.41      0.38       131
              5       0.70      0.59      0.64       161
          stars       0.00      0.00      0.00         0
    avg / total       0.54      0.52      0.53       751